Code
library(tidyverse)
library(ggridges)
library(here)
library(ghibli)The questions in this lab are noted with numbers and boldface. Each question will require you produce code, whether it is one line or multiple lines.
This document is quite “plain,” meaning it does not have any special formatting. As part of your demonstration of working with Quarto documents, I would encourage you to spice your documents up (e.g., declaring execution options, specifying how your figures should be output, formatting your code output).
Part of learning to program is learning from a variety of resources. Thus, I expect you will use resources beyond the textbook used for this course. However, there is an important balance between copying someone else’s code and using their code to learn. The course syllabus defines what is considered plagiarism in this course. Essentially, if you use external resources, I want to know about it. You can “inform” me of any resources you used by pasting the link to the resource in a code comment next to where you used that resource.
You are permitted and encouraged to work with your teammates as you complete the lab assignment, but you are expected to do your own work. Copying from each other is cheating, and letting people copy from you is also cheating. Don’t do either of those things.
In the code chunk below load in the packages necessary for your analysis. You should only need the tidyverse and here packages for this analysis, unless you decide to use additional resources.
library(tidyverse)
library(ggridges)
library(here)
library(ghibli)The Portal Project is a long-term ecological study being conducted near Portal, AZ. Since 1977, the site has been used to study the interactions among rodents, ants and plants and their respective responses to climate. To study the interactions among organisms, we experimentally manipulate access to 24 study plots. This study has produced over 100 scientific papers and is one of the longest running ecological studies in the U.S.
We will be investigating the animal species diversity and weights found within plots at the Portal study site. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
| Column | Description |
|---|---|
| record_id | Unique id for the observation |
| month | month of observation |
| day | day of observation |
| year | year of observation |
| plot_id | ID of a particular plot |
| species_id | 2-letter code |
| sex | sex of animal (“M”, “F”) |
| hindfoot_length | length of the hindfoot in mm |
| weight | weight of the animal in grams |
| genus | genus of animal |
| species | species of animal |
| taxon | e.g. Rodent, Reptile, Bird, Rabbit |
| plot_type | type of plot |
We have seen in the practice activity that when importing a dataframe, the columns that contain characters (i.e., text) can be coerced (=converted) into the factor data type. We could set stringsAsFactors to FALSE to avoid this hidden argument to convert our data type.
For this lab we will use the readr package (from the tidyverse) to read in the data. We’ll read in our data using the read_csv() function instead of the read.csv() function. This function does not coerce character variables to factors, a behavior that many in the R community feel is unappealing.
1. Using the read_csv() function and the here package, to write the code necessary to load in the surveys.csv dataset. For simplicity, name the dataset surveys.
# Code for question 1!
surveys <- read_csv(here::here("supporting_artifacts", "learning_targets", "Challenge 2", "surveys.csv"))
surveys# A tibble: 30,463 × 15
record_id month day year plot_id species…¹ sex hindf…² weight date
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <date>
1 63 8 19 1977 3 DM M 35 40 1977-08-19
2 64 8 19 1977 7 DM M 37 48 1977-08-19
3 65 8 19 1977 4 DM F 34 29 1977-08-19
4 66 8 19 1977 4 DM F 35 46 1977-08-19
5 67 8 19 1977 7 DM M 35 36 1977-08-19
6 68 8 19 1977 8 DO F 32 52 1977-08-19
7 69 8 19 1977 2 PF M 15 8 1977-08-19
8 71 8 19 1977 7 DM F 36 35 1977-08-19
9 74 8 19 1977 8 PF M 12 7 1977-08-19
10 75 8 19 1977 8 DM F 32 22 1977-08-19
# … with 30,453 more rows, 5 more variables: day_of_week <chr>,
# plot_type <chr>, genus <chr>, species <chr>, taxa <chr>, and abbreviated
# variable names ¹species_id, ²hindfoot_length
2. What are the dimensions of these data?
dim(surveys)[1] 30463 15
# 30463 rows
# 15 columnsThere are 30,463 rows and 15 columns in the surveys data.
3. What are the data types of the variables in the dataset?
str(surveys)spec_tbl_df [30,463 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ record_id : num [1:30463] 63 64 65 66 67 68 69 71 74 75 ...
$ month : num [1:30463] 8 8 8 8 8 8 8 8 8 8 ...
$ day : num [1:30463] 19 19 19 19 19 19 19 19 19 19 ...
$ year : num [1:30463] 1977 1977 1977 1977 1977 ...
$ plot_id : num [1:30463] 3 7 4 4 7 8 2 7 8 8 ...
$ species_id : chr [1:30463] "DM" "DM" "DM" "DM" ...
$ sex : chr [1:30463] "M" "M" "F" "F" ...
$ hindfoot_length: num [1:30463] 35 37 34 35 35 32 15 36 12 32 ...
$ weight : num [1:30463] 40 48 29 46 36 52 8 35 7 22 ...
$ date : Date[1:30463], format: "1977-08-19" "1977-08-19" ...
$ day_of_week : chr [1:30463] "Fri" "Fri" "Fri" "Fri" ...
$ plot_type : chr [1:30463] "Long-term Krat Exclosure" "Rodent Exclosure" "Control" "Control" ...
$ genus : chr [1:30463] "Dipodomys" "Dipodomys" "Dipodomys" "Dipodomys" ...
$ species : chr [1:30463] "merriami" "merriami" "merriami" "merriami" ...
$ taxa : chr [1:30463] "Rodent" "Rodent" "Rodent" "Rodent" ...
- attr(*, "spec")=
.. cols(
.. record_id = col_double(),
.. month = col_double(),
.. day = col_double(),
.. year = col_double(),
.. plot_id = col_double(),
.. species_id = col_character(),
.. sex = col_character(),
.. hindfoot_length = col_double(),
.. weight = col_double(),
.. date = col_date(format = ""),
.. day_of_week = col_character(),
.. plot_type = col_character(),
.. genus = col_character(),
.. species = col_character(),
.. taxa = col_character()
.. )
- attr(*, "problems")=<externalptr>
The datatypes in the survey dataset is string, character, and date.
ggplot2ggplot() graphics are built step by step by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.
To build a ggplot(), we will use the following basic template that can be used for different types of plots:
ggplot(data = <DATA>, mapping = aes(<VARIABLE MAPPINGS>)) + <GEOM_FUNCTION>()
Let’s get started!
4. First, let’s create a scatterplot of the relationship between weight (on the x-axis) and hindfoot_length (on the y-axis).
# Code for question 4!
ggplot(data = surveys,
mapping = aes(x = weight, y = hindfoot_length)) +
geom_point()We can see there are a lot of points plotted on top of each other. Let’s try and modify this plot to extract more information from it.
5. Let’s add transparency (alpha) to the points, to make the points more transparent and (possibly) easier to see.
#Code for question 5!
ggplot(data = surveys,
mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.2)Well, that is better, but there are still large clumps of data being plotted on top of each other. Let’s try another tool!
6. Add some jitter to the points in the scatterplot, using geom_jitter().
# Code for question 6!
ggplot(data = surveys,
mapping = aes(x = weight, y = hindfoot_length)) +
geom_jitter(alpha = 0.2)Despite our best efforts there is still a substantial amount of overplotting occurring in our scatterplot. Let’s try splitting the dataset into smaller subsets and see if that allows for us to see the trends a bit better.
7. Facet your jittered scatterplot by species.
# Code for question 7
ggplot(data = surveys,
mapping = aes(x = weight, y = hindfoot_length)) +
geom_jitter(alpha = 0.2) +
facet_wrap(~ species)8. Create side-by-side boxplots to visualize the distribution of weight within each species.
# Code for question 8 (and 9)!
ggplot(data = surveys,
mapping = aes(x = weight, y = species)) +
geom_boxplot() +
geom_point(alpha = 0.2, color = "mediumpurple")A fundamental complaint of boxplots is that they do not plot the raw data. However, with ggplot we can add the raw points on top of the boxplots!
9. Add another layer to your previous plot (above) that plots each observation.
Alright, this should look less than optimal. Your points should appear rather stacked on top of each other. To make them less stacked, we need to jitter them a bit, using geom_jitter().
10. Remove the previous layer you had and include a geom_jitter() layer.
That should look much better! But there is another problem! You should notice that in the code above there are both red points and black points. So, some of the observations are being plotted twice!
#code block for Question 10
ggplot(data = surveys,
mapping = aes(x = weight, y = species)) +
geom_boxplot() +
geom_jitter(alpha = 0.2, color = "mediumpurple")11. Inspect the help file for geom_boxplot() and see how you can remove the outliers from being plotted by geom_boxplot(). Make this change in the code above!
#code block for Question 11
ggplot(data = surveys,
mapping = aes(x = weight, y = species)) +
geom_jitter(alpha = 0.2, color = "mediumpurple") +
geom_boxplot(outlier.shape = NA)Some small changes that make big differences to plots. One of these changes are better labels for a plot’s axes and legend.
10. Using the code you created in question 8, modify the x-axis and y-axis labels to describe what is being plotted. Be sure to include any necessary units!
# Changed to ridgeplot
ggplot(data = surveys,
mapping = aes(x = weight, y = species)) +
xlab("Weight (Grams)") +
ylab("Species of Animals") +
geom_density_ridges(color = "navy", fill = "mediumpurple", alpha = 0.8)Picking joint bandwidth of 2.21
pretty_palette <- c("#c7a1d1", "#09a2a3", "#abd67d", "#cc689f", "#4e53a7",
"#F6B472", "#F47851", "#A8131C")
surveys |>
ggplot(data = surveys,
mapping = aes(x = weight, y = species, color = genus)) +
xlab("Weight (Grams)") +
ylab("Species of Animals") +
geom_jitter(alpha = 0.2, color = "gray84") +
geom_boxplot(outlier.shape = NA) +
scale_color_manual(values = pretty_palette) + #or scale_colour_ghibli_d("YesterdayMedium")
theme_bw()ggplot(data = surveys,
mapping = aes(x = weight, y = species, color = genus)) +
xlab("Weight (Grams)") +
ylab("Species of Animals") +
geom_jitter(alpha = 0.2, color = "gray84") +
geom_boxplot(outlier.shape = NA) +
scale_color_manual(values = pretty_palette) +
theme_bw() +
theme(legend.position = "none") +
annotate("text", y = 1.05, x = 15, label = "Neotoma", size = 3.5) +
annotate("text", y = 2.05, x = 250, label = "Chaetodipus", size = 3.5) +
annotate("text", y = 3.05, x = 250, label = "Peromyscus", size = 3.5) +
annotate("text", y = 4.05, x = 250, label = "Perognathus", size = 3.5) +
annotate("text", y = 5.05, x = 250, label = "Reithrodontomys", size = 3.5) +
annotate("text", y = 6.05, x = 250, label = "Sigmodon", size = 3.5) +
annotate("text", y = 7.05, x = 250, label = "Onychomys", size = 3.5) +
annotate("text", y = 10.05, x = 250, label = "Dipodomys", size = 3.5)Some people (and journals) prefer for boxplots to be stacked with a specific orientation! Let’s practice changing the orientation of our boxplots.
11. Flip the orientation of your boxplots from question 10. If you created side-by-side boxplots (stacked horizontally), your boxplots should be stacked vertically. If you had vertically stacked boxplots, you should stack your boxplots horizontally!
# Code for question 11!
ggplot(data = surveys,
mapping = aes(x = weight, y = species)) +
xlab("Weight (Grams)") +
ylab("Species of Animal") +
geom_jitter(alpha = 0.2, color = "mediumpurple") +
geom_boxplot(outlier.shape = NA) +
coord_flip() + #resource: ggplot2.tidyverse.org/reference/coord_flip.html
theme(axis.text.x = element_text(angle = 45)) #resource: ggplot2.tidyverse.org/reference/theme.html